Education in India¶

IBM Exploratory Data Analysis for Machine Learning: Honors Peer-graded Assignment¶


Context¶

As part of IBM Machine Learning : Exploratory Data Analysis for Machine Learning, this is a week-5 Honors project. This notebook is the proof of concept that is thought in this course. \ This is the official dataset released by the govt. of India based on the census 2001 and 2011 survey.

Content¶

The data is of 35 Indian states and union territories. The literacy rate is spread across the major parameters - Overall, Rural and Urban. All the data is percentage of the total population of that state.

About dataset¶

The data in this CSV file contains the data from the Govt. Of India website, regarding the literacy rate of the 35 states and union territories.There are 3 key fields, literacy rate overall, literacy rate urban and literacy rate rural. \ To download the dataset Click here

Inspiration¶

Understand the literacy rate in India and which states/UT's have the highest growth in terms of increased literacy rates.

Table of contents¶

1. About the dataset.

2. Total Literacy rate across nation.

3. Rural literacy rate across nation.

4. Urban literacy rate across nation.

5. State vs. Union territories.

6. Literacy Rate in each State/ Union Territory

7. A simple hypothesis testing

Libraries¶

In [1]:
import numpy as np
import pandas as pd

import scipy.stats as stats

import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import plotly as ply

pio.templates.default = "plotly_dark"

About the dataset¶

In [2]:
df = pd.read_csv("GOI.csv")
df.head(10)
Out[2]:
Category Country/ States/ Union Territories Name Literacy Rate (Persons) - Total - 2001 Literacy Rate (Persons) - Total - 2011 Literacy Rate (Persons) - Rural - 2001 Literacy Rate (Persons) - Rural - 2011 Literacy Rate (Persons) - Urban - 2001 Literacy Rate (Persons) - Urban - 2011
0 Country INDIA 64.8 73.0 58.7 67.8 79.9 84.1
1 State Andhra Pradesh 60.5 67.0 54.5 60.4 76.1 80.1
2 State Arunachal Pradesh 54.3 65.4 47.8 59.9 78.3 82.9
3 State Assam 63.3 72.2 59.7 69.3 85.3 88.5
4 State Bihar 47.0 61.8 43.9 59.8 71.9 76.9
5 State Chhattisgarh 64.7 70.3 60.5 66.0 80.6 84.0
6 State Goa 82.0 88.7 79.7 86.6 84.4 90.0
7 State Gujarat 69.1 78.0 61.3 71.7 81.8 86.3
8 State Haryana 67.9 75.6 63.2 71.4 79.2 83.1
9 State Himachal Pradesh 76.5 82.8 75.1 81.9 88.9 91.1
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 8 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Category                                 36 non-null     object 
 1   Country/ States/ Union Territories Name  36 non-null     object 
 2   Literacy Rate (Persons) - Total - 2001   36 non-null     float64
 3   Literacy Rate (Persons) - Total - 2011   36 non-null     float64
 4   Literacy Rate (Persons) - Rural - 2001   36 non-null     float64
 5   Literacy Rate (Persons) - Rural - 2011   36 non-null     float64
 6   Literacy Rate (Persons) - Urban - 2001   36 non-null     float64
 7   Literacy Rate (Persons) - Urban - 2011   36 non-null     float64
dtypes: float64(6), object(2)
memory usage: 2.4+ KB

Note: There is a word Persons in the column names. My assumption is for every 100 Persons the literacy rate is documented.

We have data for two years 2011 and 2001 which have a difference of a decade between them. We can generate new attribute to see the percentage change in literacy rate over the decade.

In [4]:
df['Total - Per. Change'] = (df.loc[:,'Literacy Rate (Persons) - Total - 2011'] - 
                df.loc[:,'Literacy Rate (Persons) - Total - 2001'])/df.loc[:,'Literacy Rate (Persons) - Total - 2001']
df['Rural - Per. Change'] = (df.loc[:,'Literacy Rate (Persons) - Rural - 2011'] - 
                df.loc[:,'Literacy Rate (Persons) - Rural - 2001'])/df.loc[:,'Literacy Rate (Persons) - Total - 2001']
df['Urban - Per. Change'] = (df.loc[:,'Literacy Rate (Persons) - Urban - 2011'] - 
                df.loc[:,'Literacy Rate (Persons) - Urban - 2001'])/df.loc[:,'Literacy Rate (Persons) - Total - 2001']

The column names are too long, so I will remove characters before Total-'year', Rural-'year' and Urban-'year'

In [5]:
new_col=[]
for i in df.columns:
    new_col.append(i.split('(Persons) - ')[-1])
df.columns=new_col

df.head()
Out[5]:
Category Country/ States/ Union Territories Name Total - 2001 Total - 2011 Rural - 2001 Rural - 2011 Urban - 2001 Urban - 2011 Total - Per. Change Rural - Per. Change Urban - Per. Change
0 Country INDIA 64.8 73.0 58.7 67.8 79.9 84.1 0.126543 0.140432 0.064815
1 State Andhra Pradesh 60.5 67.0 54.5 60.4 76.1 80.1 0.107438 0.097521 0.066116
2 State Arunachal Pradesh 54.3 65.4 47.8 59.9 78.3 82.9 0.204420 0.222836 0.084715
3 State Assam 63.3 72.2 59.7 69.3 85.3 88.5 0.140600 0.151659 0.050553
4 State Bihar 47.0 61.8 43.9 59.8 71.9 76.9 0.314894 0.338298 0.106383

We have data of the whole country, the states and union territories. I am going to view the overall Literacy rates of the country and then we'll remove this from our dataset. So that it is easy for us to view and compare literacy rates amongst States/ Union Territories.

In [6]:
India = df[df['Category'] == 'Country'].T
India = India.iloc[2:8,:]
India.reset_index(inplace=True)
India.columns = ['Measure', 'Value']
India.loc[:,'Measure'] = India['Measure'].apply(lambda x : str(x).split(' -')[0])
India_2001 = India.iloc[[0,2,4],:]
India_2011 = India.iloc[[1,3,5],:]
In [7]:
fig = go.Figure(data=[
    go.Bar(name='2001', x=India_2001['Measure'], y=India_2001['Value'], marker_color='rgb(55, 83, 109)'),
    go.Bar(name='2011', x=India_2011['Measure'], y=India_2011['Value'], marker_color='rgb(26, 118, 255)')
])
fig.update_layout(yaxis_range=[0, 100],barmode='group', title='Overall Literacy Rate in India :',yaxis_title="Persons")
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green')

fig.show()

Observation:¶

  • The total literacy rate in india has incresed by 8.2 units in previous dacade. That is an increse of 12.65% in the previous measure.
  • The literacy rate in rural india has incresed by 9.1 units in the previous dacade. That is an increse of 14.04% in the previous measure.
  • The literacy rate in urban india has incresed by 4.5 units in the previous decade. That is an increse of 6% in the previous measure.

We have three attributes for literacy rates: total, rural and urban. We'll take a look on each of them to see how they're distributed across the nation.

In [8]:
df = df.iloc[1:,:] #Removing data for India as a whole country.
df.rename(columns={'Country/ States/ Union Territories Name' :'States/ Union Territories'}, inplace = True) 
df.head()
Out[8]:
Category States/ Union Territories Total - 2001 Total - 2011 Rural - 2001 Rural - 2011 Urban - 2001 Urban - 2011 Total - Per. Change Rural - Per. Change Urban - Per. Change
1 State Andhra Pradesh 60.5 67.0 54.5 60.4 76.1 80.1 0.107438 0.097521 0.066116
2 State Arunachal Pradesh 54.3 65.4 47.8 59.9 78.3 82.9 0.204420 0.222836 0.084715
3 State Assam 63.3 72.2 59.7 69.3 85.3 88.5 0.140600 0.151659 0.050553
4 State Bihar 47.0 61.8 43.9 59.8 71.9 76.9 0.314894 0.338298 0.106383
5 State Chhattisgarh 64.7 70.3 60.5 66.0 80.6 84.0 0.086553 0.085008 0.052550

Total Literacy Rate Across Nation:¶

In [9]:
df.sort_values(by='Total - 2001', inplace=True)

fig = go.Figure(data=[
    go.Bar(name='2001', x=df['Total - 2001'], y=df['States/ Union Territories'], orientation='h', marker_color='rgb(255, 0, 96)'),
    go.Bar(name='2011', x=df['Total - 2011'], y=df['States/ Union Territories'], orientation='h', marker_color='rgb(0, 223, 162)')
])

fig.update_layout(xaxis_range=[0, 100],barmode='group', title = 'Total Literacy Rate Across Nation :', yaxis_title = "States/ Union Territories", 
                  xaxis_title = "Total Literacy Rate")
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green')

fig.show()

Top 5 highest and lowest "Total literacy rate" in 2001 across India¶

In [10]:
lowest_2001 = df.sort_values(by=['Total - 2001']).head()
highest_2001 = df.sort_values(by=['Total - 2001']).tail()

fig = go.Figure(data = [
    go.Bar(name = 'Lowest_2001', x=lowest_2001['Total - 2001'], y=lowest_2001['States/ Union Territories'],orientation='h', marker_color='rgb(246, 250, 112)'),
    go.Bar(name = 'Highest_2001', x=highest_2001['Total - 2001'], y=highest_2001['States/ Union Territories'],orientation='h', marker_color='rgb(0, 121, 255)')
])

fig.update_layout(xaxis_range=[0, 100],barmode='group', title = ' Top 5 highest and lowest "Total literacy rate" in 2001 :', xaxis_title = "Total Literacy Rate in 2001",
                 yaxis_title = "States/ Union Territories")

fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green')

fig.show()

Top 5 highest and lowest "Total literacy rate" in 2011 across India¶

In [11]:
lowest_2011 = df.sort_values(by=['Total - 2011']).head()
highest_2011 = df.sort_values(by=['Total - 2011']).tail()

fig = go.Figure(data = [
    go.Bar(name = 'Lowest_2011', x=lowest_2011['Total - 2011'], y=lowest_2011['States/ Union Territories'],orientation='h', marker_color='rgb(246, 250, 112)'),
    go.Bar(name = 'Highest_2011', x=highest_2011['Total - 2011'], y=highest_2011['States/ Union Territories'], orientation='h', marker_color='rgb(0, 121, 255)')
])


fig.update_layout(xaxis_range=[0, 100],barmode='group', title = ' Top 5 highest and lowest "Total literacy rate" in 2011 :', xaxis_title = "Total Literacy Rate in 2001",
                 yaxis_title = "States/ Union Territories")


fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green')
fig.show()
In [12]:
px.bar(df.sort_values(by='Total - Per. Change'),
       x='Total - Per. Change', y='States/ Union Territories',
       color='Total - Per. Change', title='Totel Per. Change')

Observation :¶

  • Bihar, Jharkhand, Arunachal Pradesh, Jammu & Kashmir and Uttar Pradesh were the least literate states/Union Territories in 2001.
  • Kerala, Mizoram, Lakshadweep, Goa and Chandigarh are the most literate states/Union territories in 2001.
  • Rajasthan and Andhra Pradesh Couldn't keep up with other states and fell in 5 least literate states with Bihar, Arunachal Pradesh and Jharkhand. Whereas Jammu & kashmir and Uttar pradesh managed to improve in 2011.
  • Tripura managed to increse it's literacy rate to 5 most literate states along with Kerala, Lakshadweep, mizoram and Goa in 2011.
  • Mizoram, Kerala, Chandigarh, NCT of Delhi and Ponducherry have least percentage increse in literacy rate.
  • Percentage Increse in Total Literacy is highest in D & N Haveli, Bihar, Jharkhand, Jammu & Kashmir and Arunachal Pradesh.
  • In Year 2001 total 13 States/Union Territories had lesser literacy rate than overall indian literacy rate.
  • In Year 2011 total 11 States/Union Territories had lesser literacy rate than overall indian literacy rate. Meghalaya and D & N Haveli managed to increse their literacy rate.
  • Bihar, Jharkhand, Arunachal Pradesh, Jammu & Kashmir, Uttar Pradesh, Rajasthan, Andhra Pradesh, Odisha, Assam, Madhya Pradesh and Chhattisgarh still have lesser Total literacy rate than overall literacy rate of the Country.

Rural literacy rate across India¶

In [13]:
df.sort_values(by='Rural - 2001', inplace=True)

fig = go.Figure(data = [
    go.Bar(name='2001', x=df['Rural - 2001'], y=df['States/ Union Territories'], orientation='h', marker_color='rgb(255, 0, 96)'),
    go.Bar(name='2011', x=df['Rural - 2011'], y=df['States/ Union Territories'],  orientation='h', marker_color='rgb(0, 223, 162)')
])

fig.update_layout(xaxis_range=[0, 100],barmode='group', title = 'Literacy rate in rural areas acorss the country :', yaxis_title = "States/ Union Territories", 
                  xaxis_title = "Rural India Literacy Rate")

fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green')

fig.show()

Top 5 highest and lowest "Rural literacy rate" in 2001 across India¶

In [14]:
lowest_2001 = df.sort_values(by=['Rural - 2001']).head()
highest_2001 = df.sort_values(by=['Rural - 2001']).tail()

fig = go.Figure(data = [
    go.Bar(name = 'Lowest_2001', x=lowest_2001['Rural - 2001'], y=lowest_2001['States/ Union Territories'],orientation='h', marker_color='rgb(246, 250, 112)'),
    go.Bar(name = 'Highest_2001', x=highest_2001['Rural - 2001'], y=highest_2001['States/ Union Territories'], orientation='h', marker_color='rgb(0, 121, 255)' )
])

fig.update_layout(xaxis_range=[0, 100],barmode='group', title = 'Top 5 highest and Lowest "Rural literacy rate" in 2001 :',
                 yaxis_title = "States/ Union Territories", 
                  xaxis_title = "Rural India Literacy Rate in 2001")

fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green')
fig.show()

Top 5 highest and lowest "Rural literacy rate" in 2011 across India¶

In [15]:
lowest_2011 = df.sort_values(by=['Rural - 2011']).head()
highest_2011 = df.sort_values(by=['Rural - 2011']).tail()

fig = go.Figure(data = [
    go.Bar(name = 'Lowest_2011', x=lowest_2011['Rural - 2011'], y=lowest_2011['States/ Union Territories'],orientation='h', marker_color='rgb(246, 250, 112)'),
    go.Bar(name = 'Highest_2011', x=highest_2011['Rural - 2011'], y=highest_2011['States/ Union Territories'], orientation='h', marker_color='rgb(0, 121, 255)' )
])

fig.update_layout(xaxis_range=[0, 100],barmode='group', title = 'Top 5 highest and Lowest "Rural literacy rate" in 2011 :',
                 yaxis_title = "States/ Union Territories", 
                  xaxis_title = "Rural India Literacy Rate in 2011")

fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green',zeroline=True, zerolinewidth=1.5, zerolinecolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green',zeroline=True, zerolinewidth=1.5, zerolinecolor='green')
fig.show()
In [16]:
px.bar(df.sort_values(by='Rural - Per. Change',ascending=True),
       x='Rural - Per. Change', y='States/ Union Territories',
       color='Rural - Per. Change', title='Rural Per. Change')

Observations :¶

  • We have the same distribution of rural literacy rate among States/Union Territories as we saw in total literacy rate.

  • Bihar, Jharkhand, Jammu & Kashmir, D & N Haveli and Utter Pradesh have worked hard in their rural areas and thus they have highest percentage increrse in rural literacy rate.

  • Mizoram, Kerala, NCT of Delhi, Chandigarh and A & N Islands have least percentage increse in rural literacy rate.

  • The states that have worked the most in their rural areas are the ones which had least rural literacy rate in 2001.

Urban Literacy rate across the nation¶

In [17]:
df.sort_values(by='Urban - 2001', inplace=True)

fig = go.Figure(data = [
    go.Bar(name='2001', x=df['Urban - 2001'], y=df['States/ Union Territories'],  orientation='h', marker_color='rgb(255, 0, 96)'),
    go.Bar(name='2011', x=df['Urban - 2011'], y=df['States/ Union Territories'], orientation='h', marker_color='rgb(0, 223, 162)')
])

fig.update_layout(xaxis_range=[0, 100],barmode='group', title = 'Literacy rate in urban areas acorss the country :', yaxis_title = "States/ Union Territories", 
                  xaxis_title = "Urban India Literacy Rate")

fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green')

fig.show()

Top 5 highest and lowest "Urban literacy rate" in 2001 across India¶

In [18]:
lowest_2001 = df.sort_values(by=['Urban - 2001']).head()
highest_2001 = df.sort_values(by=['Urban - 2001']).tail()

fig = go.Figure(data = [
    go.Bar(name = 'Lowest_2001', x=lowest_2001['Urban - 2001'], y=lowest_2001['States/ Union Territories'], orientation='h', marker_color='rgb(246, 250, 112)' ),
    go.Bar(name = 'Highest_2001', x=highest_2001['Urban - 2001'], y=highest_2001['States/ Union Territories'],orientation='h', marker_color='rgb(0, 121, 255)' )
])

fig.update_layout(xaxis_range=[0, 100],barmode='group', title = 'Top 5 highest and Lowest "Urban literacy rate" in 2001 :',
                 yaxis_title = "States/ Union Territories", 
                  xaxis_title = "Urban India Literacy Rate in 2001")

fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green',zeroline=True, zerolinewidth=1.5, zerolinecolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green',zeroline=True, zerolinewidth=1.5, zerolinecolor='green')
fig.show()

Top 5 highest and lowest "Urban literacy rate" in 2011 across India¶

In [19]:
lowest_2011 = df.sort_values(by=['Urban - 2011']).head()
highest_2011 = df.sort_values(by=['Urban - 2011']).tail()

fig = go.Figure(data = [
    go.Bar(name = 'Lowest_2011', x=lowest_2001['Urban - 2011'], y=lowest_2011['States/ Union Territories'], orientation='h', marker_color='rgb(246, 250, 112)' ),
    go.Bar(name = 'Highest_2011', x=highest_2001['Urban - 2011'], y=highest_2011['States/ Union Territories'],orientation='h', marker_color='rgb(0, 121, 255)' )
])

fig.update_layout(xaxis_range=[0, 100],barmode='group', title = 'Top 5 highest and Lowest "Urban literacy rate" in 2011 :',
                 yaxis_title = "States/ Union Territories", 
                  xaxis_title = "Urban India Literacy Rate in 2011")

fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green',zeroline=True, zerolinewidth=1.5, zerolinecolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green',zeroline=True, zerolinewidth=1.5, zerolinecolor='green')
fig.show()
In [20]:
px.bar(df.sort_values(by='Urban - Per. Change',ascending=True),
       x='Urban - Per. Change', y='States/ Union Territories',
       color='Urban - Per. Change', title='Urban Per. Change')

Observations :¶

  • Again, We have the same distribution of rural literacy rate among States/Union Territories as we saw in total literacy rate.
  • States/Union Territories that had higher urban literacy rate in 2001 have lesser percentage increse and those which had lesser urban literacy have worked hard on their literacy rate.

Note: for these parameters ['Total - Per. Change', 'Rural - Per. Change', 'Urban - Per. Change'] I want them in percentage. I will multiply the values by 100.

In [21]:
columns_to_multiply = ['Total - Per. Change', 'Rural - Per. Change', 'Urban - Per. Change']

df[columns_to_multiply] = df[columns_to_multiply] * 100
df.head()
Out[21]:
Category States/ Union Territories Total - 2001 Total - 2011 Rural - 2001 Rural - 2011 Urban - 2001 Urban - 2011 Total - Per. Change Rural - Per. Change Urban - Per. Change
26 State Uttar Pradesh 56.3 67.7 52.5 65.5 69.8 75.1 20.248668 23.090586 9.413854
4 State Bihar 47.0 61.8 43.9 59.8 71.9 76.9 31.489362 33.829787 10.638298
10 State Jammu & Kashmir 55.5 67.2 49.8 63.2 71.9 77.1 21.081081 24.144144 9.369369
1 State Andhra Pradesh 60.5 67.0 54.5 60.4 76.1 80.1 10.743802 9.752066 6.611570
22 State Rajasthan 60.4 66.1 55.3 61.4 76.2 79.7 9.437086 10.099338 5.794702

States vs Union Territories¶

In [22]:
temp_1 = df.groupby(by=['Category'])['Total - 2001'].mean().reset_index().T
temp_2 = df.groupby(by=['Category'])['Total - 2011'].mean().reset_index().T

temp_3 = df.groupby(by=['Category'])['Rural - 2001'].mean().reset_index().T
temp_4 = df.groupby(by=['Category'])['Rural - 2011'].mean().reset_index().T

temp_5 = df.groupby(by=['Category'])['Urban - 2001'].mean().reset_index().T
temp_6 = df.groupby(by=['Category'])['Urban - 2011'].mean().reset_index().T

frames = [temp_1, temp_2, temp_3, temp_4, temp_5, temp_6]
temp = pd.concat(frames)
loc = [0,1,3,5,7,9,11]
temp = temp.iloc[loc,:]
temp = temp.iloc[1:,:]
temp.reset_index(inplace=True)
temp.columns=['Category','State','Union Territory']


fig = go.Figure(data = [
    go.Bar(name='States', y=temp['Category'], x=temp['State'], orientation='h', marker_color='rgb(26, 118, 255)'),
    go.Bar(name='Union Territories', y=temp['Category'], x=temp['Union Territory'], orientation='h', marker_color='rgb(55, 83, 109)')
])
fig.update_layout(barmode='group')
fig.show()

Observations:¶

Average Literacy rate in union territories have always been greater than that of states in every category.

Literacy Rate in each State/ Union Territory¶

An interactive visualization of all states/ union territories and the observations.

In [23]:
df1 = pd.melt(df, id_vars='States/ Union Territories', value_vars=['Total - 2001', 'Total - 2011',
       'Rural - 2001', 'Rural - 2011', 'Urban - 2001', 'Urban - 2011','Total - Per. Change', 'Rural - Per. Change', 'Urban - Per. Change'])

fig = px.bar(df1, 'variable', 'value', animation_frame='States/ Union Territories',color='value',
             color_discrete_sequence='Viridis',title='Literacy Rate of each State/ Union Territory.'
            )

fig.update_layout(yaxis_range=[0, 100],xaxis_title = 'State/ Union Territory' )
fig.update_traces(marker_line_color='rgb(8,48,107)', marker_line_width=1.5,texttemplate='%{value}', textposition='outside')
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green',zeroline=True, zerolinewidth=1.5, zerolinecolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green',zeroline=True, zerolinewidth=1.5, zerolinecolor='green')

fig.show()

A simple Hypothesis testing¶

Null Hypothesis ($H_{o}$) : There is no significant difference between 2001 and 2011 in Total Literacy rate across India.

In [24]:
# Assuming 'df' is your DataFrame
data_2001 = df['Total - 2001']
data_2011 = df['Total - 2011']

# Perform a paired t-test
t_stat, p_value = stats.ttest_rel(data_2001, data_2011)

alpha = 0.05

if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference between 2001 and 2011.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between 2001 and 2011.")
Reject the null hypothesis. There is a significant difference between 2001 and 2011.
In [25]:
# Assuming 'df' is your DataFrame
data_2001 = df['Total - 2001']
data_2011 = df['Total - 2011']

# Perform a paired t-test
t_stat, p_value = stats.ttest_rel(data_2001, data_2011)

alpha = 0.05

# Create a DataFrame to hold the means and confidence intervals
results = pd.DataFrame({'Year': ['2001', '2011'],
                        'Mean': [data_2001.mean(), data_2011.mean()],
                        'CI_Lower': [data_2001.mean() - 1.96 * data_2001.std(), data_2011.mean() - 1.96 * data_2011.std()],
                        'CI_Upper': [data_2001.mean() + 1.96 * data_2001.std(), data_2011.mean() + 1.96 * data_2011.std()]})

# Create a Plotly figure
fig = px.bar(results, x='Year', y='Mean', error_y='CI_Upper', error_y_minus='CI_Lower', title='Comparison of Means (2001 vs. 2011)')

# Add a line to indicate the significance level
if p_value < alpha:
    fig.add_shape(type="line",
                  x0=-0.5, x1=1.5, y0=data_2001.mean() + 0.05, y1=data_2001.mean() + 0.05,
                  line=dict(color="red"), name="Significance Level")
    fig.add_annotation(x=0.35, y=data_2001.mean() + 7, text="Significance Level", showarrow=False, font=dict(color="green",size=16)
                      )
    fig.add_annotation(x=1, y=data_2001.mean() + 150, text=f"p-value: {p_value:.3f}", showarrow=False, font=dict(color="orange", size=14))
    fig.add_annotation(x=0.04, y=data_2001.mean() + 150, text=f"alpha: {alpha:.3f}", showarrow=False, font=dict(color="yellow", size=14))

    fig.update_traces(marker_line_color='rgb(8,48,107)', marker_line_width=1.5,texttemplate='mean=%{y:.2f}',
                      textposition='inside', insidetextanchor='start', textfont=dict(color='black',size=14))
fig.update_layout(xaxis_title = 'Total Literacy rate (2001-2011)' )
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='green',zeroline=True, zerolinewidth=1.5, zerolinecolor='green')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='green',zeroline=True, zerolinewidth=1.5, zerolinecolor='green')

fig.show()

Observations:¶

A p-value of 0 in a hypothesis test typically means that the test found extremely strong evidence against the null hypothesis. In the context of a paired t-test comparing two groups, it suggests that there is a significant difference between the two groups being compared.

Specifically, in this case, where I am comparing data from 2001 and 2011, a p-value of 0 means that there is very strong statistical evidence to conclude that the means of the two groups (2001 and 2011) are significantly different. In other words, the data from 2001 and 2011 are not just different by chance; the difference is highly significant.